7 research outputs found

    P3Depth: Monocular Depth Estimation with a Piecewise Planarity Prior

    Full text link
    Monocular depth estimation is vital for scene understanding and downstream tasks. We focus on the supervised setup, in which ground-truth depth is available only at training time. Based on knowledge about the high regularity of real 3D scenes, we propose a method that learns to selectively leverage information from coplanar pixels to improve the predicted depth. In particular, we introduce a piecewise planarity prior which states that for each pixel, there is a seed pixel which shares the same planar 3D surface with the former. Motivated by this prior, we design a network with two heads. The first head outputs pixel-level plane coefficients, while the second one outputs a dense offset vector field that identifies the positions of seed pixels. The plane coefficients of seed pixels are then used to predict depth at each position. The resulting prediction is adaptively fused with the initial prediction from the first head via a learned confidence to account for potential deviations from precise local planarity. The entire architecture is trained end-to-end thanks to the differentiability of the proposed modules and it learns to predict regular depth maps, with sharp edges at occlusion boundaries. An extensive evaluation of our method shows that we set the new state of the art in supervised monocular depth estimation, surpassing prior methods on NYU Depth-v2 and on the Garg split of KITTI. Our method delivers depth maps that yield plausible 3D reconstructions of the input scenes. Code is available at: https://github.com/SysCV/P3DepthComment: Accepted at CVPR 202

    Don't Forget The Past: Recurrent Depth Estimation from Monocular Video

    Full text link
    Autonomous cars need continuously updated depth information. Thus far, depth is mostly estimated independently for a single frame at a time, even if the method starts from video input. Our method produces a time series of depth maps, which makes it an ideal candidate for online learning approaches. In particular, we put three different types of depth estimation (supervised depth prediction, self-supervised depth prediction, and self-supervised depth completion) into a common framework. We integrate the corresponding networks with a ConvLSTM such that the spatiotemporal structures of depth across frames can be exploited to yield a more accurate depth estimation. Our method is flexible. It can be applied to monocular videos only or be combined with different types of sparse depth patterns. We carefully study the architecture of the recurrent network and its training strategy. We are first to successfully exploit recurrent networks for real-time self-supervised monocular depth estimation and completion. Extensive experiments show that our recurrent method outperforms its image-based counterpart consistently and significantly in both self-supervised scenarios. It also outperforms previous depth estimation methods of the three popular groups. Please refer to https://www.trace.ethz.ch/publications/2020/rec_depth_estimation/ for details.Comment: Please refer to our webpage for details https://www.trace.ethz.ch/publications/2020/rec_depth_estimation

    Improving depth learning with scene priors

    No full text
    Monocular depth estimation is vital for scene understanding and downstream tasks. It consists in predicting the perpendicular coordinate of the 3D point depicted at each pixel. Applications range from robotics to autonomous cars. In this thesis, we illustrate the use of deep learning techniques to improve monocular depth estimation. In particular, a deep learning framework is applied to learn from various spatial and temporal priors in the scene geometry to solve the ambiguous and ill-posed problem of estimating depth from a single view. The thesis contributes to several areas of depth learning paradigms, including depth estimation and completion, using both supervised and self-supervised deep learning algorithms. In the first contribution, we propose a novel supervised method to exploit the high regularity of real 3D scenes. The proposed method learns to selectively leverage information from coplanar pixels to improve the predicted depth. In particular, we introduce a piecewise planarity prior which states that for each pixel, there is a seed pixel that shares the same planar 3D surface with the former. Motivated by this prior, we design an end-to-end trainable network with two output heads. The first head outputs pixel-level plane coefficients, while the second one outputs a dense offset vector field that identifies the positions of seed pixels. The plane coefficients of seed pixels are then used to predict depth at each position. The resulting prediction is adaptively fused with the initial prediction from the first head via learned confidence to account for potential deviations from precise local planarity. Overall, this method learns to predict regular depth maps, with sharp edges at occlusion boundaries. The second contribution focuses on a method that produces a time series of depth maps. In applications such as autonomous robots, depth is mostly estimated independently for a single frame at a time, even if the method starts from video input. The proposed method extends single-frame depth estimation to videos by integrating a ConvLSTM module into the training framework so as to exploit the spatiotemporal structure of depth across frames for improved depth estimation. The flexibility of the proposed method is demonstrated across different depth learning paradigms i.e. supervised depth prediction, self-supervised depth prediction, and self-supervised depth completion. Therefore, the method can be applied to monocular videos or in combination with different types of sparse depth patterns. Additionally, we highlight issues of training a ConvLSTM-based network for dense prediction tasks with video inputs and propose a novel training strategy to mitigate them. Our third contribution is a self-supervised learning framework to estimate individual object motion and monocular depth from video. Here, we model the object motion as a six-degree-of-freedom rigid-body transformation. The instance segmentation mask is leveraged to introduce information about an object. Compared with methods that predict dense optical flow maps to model the motion, our approach significantly reduces the number of values to be estimated. Our system eliminates the scale ambiguity of motion prediction by imposing a novel geometric constraint loss term. In the fourth contribution, we extend the depth completion problem by using map-based depth data as an additional input instead of expensive depth sensors. Such an approach is especially appealing in autonomous driving since map-based depth is commonly available from high-definition maps. To validate this approach, we propose a mapping method that works with common autonomous driving datasets and allows for precise localization using a mix of GNSS-INS and image-based techniques. Furthermore, we present an entirely learnable three-stage network that handles foreground-background mismatches between the map-based prior depth and the actual scene

    P3Depth: Monocular Depth Estimation with a Piecewise Planarity Prior

    No full text
    Monocular depth estimation is vital for scene understanding and downstream tasks. We focus on the supervised setup, in which ground-truth depth is available only at training time. Based on knowledge about the high regularity of real 3D scenes, we propose a method that learns to selectively leverage information from coplanar pixels to improve the predicted depth. In particular, we introduce a piecewise planarity prior which states that for each pixel, there is a seed pixel which shares the same planar 3D surface with the former. Motivated by this prior, we design a network with two heads. The first head outputs pixel-level plane coefficients, while the second one outputs a dense off-set vector field that identifies the positions of seed pixels. The plane coefficients of seed pixels are then used to predict depth at each position. The resulting prediction is adaptively fused with the initial prediction from the first head via a learned confidence to account for potential deviations from precise local planarity. The entire architecture is trained end-to-end thanks to the differentiability of the proposed modules and it learns to predict regular depth maps, with sharp edges at occlusion boundaries. An extensive evaluation of our method shows that we set the new state of the art in supervised monocular depth estimation, surpassing prior methods on NYU Depth-v2 and on the Garg split of KITTI. Our method delivers depth maps that yield plausible 3D reconstructions of the input scenes

    Lidar Line Selection with Spatially-Aware Shapley Value for Cost-Efficient Depth Completion

    No full text
    Lidar is a vital sensor for estimating the depth of a scene. Typical spinning lidars emit pulses arranged in several horizontal lines and the monetary cost of the sensor increases with the number of these lines. In this work, we present the new problem of optimizing the positioning of lidar lines to find the most effective configuration for the depth completion task. We propose a solution to reduce the number of lines while retaining the up-to-the-mark quality of depth completion. Our method consists of two components, (1) line selection based on the marginal contribution of a line computed via the Shapley value and (2) incorporating line position spread to take into account its need to arrive at image-wide depth completion. Spatially-aware Shapley values (SaS) succeed in selecting line subsets that yield a depth accuracy comparable to the full lidar input while using just half of the lines.ISSN:2640-349
    corecore